Usenet 1993 July

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1993 July / InfoMagic USENET CD-ROM July 1993.ISO / sources / misc / volume21 / cloops / part03 < prev next >

Wrap

Text File | 1991-07-25 | 30.1 KB | 723 lines

Newsgroups: comp.sources.misc From: Martin Fouts <fouts@clipper.ingr.com> Subject: v21i038: cloops - Livermore Loops in C, Part03/03 Message-ID: <1991Jul25.020740.28375@sparky.IMD.Sterling.COM> X-Md4-Signature: 66bbb34c53801c38f1ef1b86799174b7 Date: Thu, 25 Jul 1991 02:07:40 GMT Approved: kent@sparky.imd.sterling.com Submitted-by: Martin Fouts <fouts@clipper.ingr.com> Posting-number: Volume 21, Issue 38 Archive-name: cloops/part03 Environment: Cray2, Alliant Convex Amdahl, Sun, SGI ----- cut here for Part03 #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of archive 3 (of 3)." # Contents: cloops.tex # Wrapped by fouts@bozeman on Sun Jul 21 18:23:21 1991 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'cloops.tex' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'cloops.tex'\" else echo shar: Extracting \"'cloops.tex'\" $28304 characters$ sed "s/^X//" >'cloops.tex' <<'END_OF_FILE' X% C-Loops X% X% A Paper Describing the C version of the Livermore Loops X% Copyright (C) 1991 Martin Fouts X% X% This program is free software; you can redistribute it and/or modify X% it under the terms of the GNU General Public License as published by X% the Free Software Foundation; either version 1, or (at your option) X% any later version. X% X% This program is distributed in the hope that it will be useful, X% but WITHOUT ANY WARRANTY; without even the implied warranty of X% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the X% GNU General Public License for more details. X% X% You should have received a copy of the GNU General Public License X% along with this program; if not, write to the Free Software X% Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. X% X X\documentstyle[12pt,twoside,fleqn,openbib]{article} X\pagestyle{myheadings} X\markboth{DRAFT of \today}{DRAFT of \today} X X\title{ The Livermore Loops in C X\\ DRAFT of \today } X X\author{ XMartin Fouts \\ XMathematician \\ XNAS Systems Development Branch \\ XNASA Ames Research Center} X X X\begin{document} X\maketitle X%\tableofcontents X%\vfill X%\pagebreak X\markboth{DRAFT of \today}{DRAFT of \today} X X\begin{abstract} X XThe Livermore Loops were transliterated from Fortran into C. The Xresulting program was run on several machines which support both a C Xand Fortran compiler. The effort taken in porting the loops is Xdiscussed, and the results are presented. Some of the more Xinteresting results are analyzed. Problems which arise from Xusing the Loops as a benchmark suite are discussed. Some conclusions Xabout numerical computation in C are drawn. X X\end{abstract} X X\section{Background} X XNumerical calculations, especially those involving vector operations Xhave long been the territory of Fortran and it is widely accepted that Xno other high level language offers the same opportunity for compiler Xoptimization of source code. X XThe Livermore Loops\cite{McMahon86} are widely used as a benchmark of Xnumerical intensive calculation in Fortran and results are available Xfor a wide range of machines. The loops were transliterated from XFortran into C and results were gathered for several machines. On Xall machines the C compiler produces faster overall performance. XBoth languages calculated the same results. X XIn performing the conversion, some difficulties using the Livermore XLoops as a benchmark suite were uncovered. They include the ease with Xwhich some compilers optimize away timing loops and the lack of Xsensitivity in the validation routines. These difficulties are Xdiscussed in section 3. X XSome semantic differences between C and Fortran were uncovered which Xhave an effect on the precision of the results generated by the Xprogram on certain machines. These differences are discussed in Xsection 3. X XThis experiment makes an interesting comment on the state of Xoptimizing compilers, the ability to write Fortran in any language, Xand the utility of the Livermore Loops as a vector benchmark. X XFor the purpose of this paper, Fortran refers to the Fortran language Xas described in the Fortran-77 standard\cite{ANSI77} and C refers to Xthe programming language as described in \cite{Kern78}. References to Xspecific dialects will be labeled as such. X X\section{Overview of Transliteration} X XIn converting the Livermore Loops from Fortran to C, the major goal Xwas to preserve the original Fortran semantics. Some changes were Xnecessary, including: X X\begin{itemize} X X\item The routine \tt SPACE \rm was deleted. Since the pointer Xassignment isn't used in the Fortran implementation, this has no Xeffect on the final results. X X\item The variable \tt lp \rm which controls the number of passes was Xmade an input parameter. The default is 100 passes. When the default Xis used, standard results are obtained. X X\item In kernel 15, arithmetic if statements were modified to Xif-then-else forms. X X\item All occurrences of the form \tt x**2 \rm were replaced by \tt Xx*x. X X\item \tt REPORT \rm was truncated. Only the flop rates and averages Xare reported. Since the C version is only intended as a comparison to XFortran on the same machine, the analysis portion is not needed. X X\item \tt SIGNAL \rm was reimplemented as several separate routines to Xwork around differences in C and Fortran parameter passing. Further, X\tt VECTOR \rm was modified to call \tt lsignalN \rm for each array, Xsince C has no equivalent to a \tt COMMON \rm block. \tt SIGNAL \rm Xwas renamed to \tt lsignalN \rm where \tt N \rm is replaced by 0, 1, X2, or 3 to represent the number of dimensions in the array being Xinitialized. The separate routines are not strictly needed, but Xclarify the intent of the code. X XSince \tt SIGNAL \rm is not part of the measured code, this has no Xeffect on the reported performance. X X\item The number generation routine used by \tt SIGNAL \rm was Xcarefully reimplemented to duplicate Fortran numeric semantics. X X\item \tt SUMO \rm was reimplemented as \tt sumo, sumo2, \rm and \tt Xsumo3 \rm, depending on the number of dimensions in the array being Xinitialized. As with signal, this was done for clarity and is in code Xwhich is not part of that being measured. X X\item A command line option was added allowing only specified loops to Xbe run. This was done for debugging. The reported results are from a Xsingle run in which all loops were performed. X X\item All array references were converted from the 1 based subscript Xused by Fortran to 0 based subscripts as used by C. This was done to Xavoid holes in the C arrays which would have resulted in different Xmemory use patterns. This would have an occasional small adverse Ximpact on the C version, since some array references now involve Xsubtracting one from the indices. X X\end{itemize} X XA major change which wasn't made was transposing array indices. X X\section{Language Differences Encountered} X XThree major differences between Fortran and C were encountered which Xhave an impact on the transliteration. X X\begin{itemize} X X\item Different array index basis are used by the languages. X\item Fortran uses \tt COMMON \rm blocks to allocate global data, Xwhile C merely requires declaring it. X\item C and Fortran have different rules for arithmetic precision. Some Xminor differences also exist. X X\end{itemize} X XEach of these will be discussed in turn. X XThe array index basis difference has two implications. One of these is Xthe way in which indices relate to memory locations in multiple Xdimension arrays, and the other is the difference in array basing. XC arrays are always indexed from 0 to \tt N - 1 \rm where \tt N \rm is Xthe number of elements declared for that index of the array. The Xdefault basing in Fortran is to index from 1 to \tt N, \rm although 0 Xbasing is an option. All of the arrays in the Livermore Loops are 1 Xbased. In the transliteration, this was explicitly adjusted for. In Xthe main, this consists of adjusting loop indices to run from 0 to \tt XN - 1 \rm rather than from 1 to \tt N \rm and subtracting 1 from Xconstant indices as part of the transliteration. X XThis can also be a problem when integer values are calculated by a Xprogram and then used as loop indices. An attempt was made to find Xall such places and convert them appropriately. In a few cases, the Xcalculations were left alone, and 1 was subtracted when the integer Xvariable was used as a subscript. X XThe other difference in arrays has to do with multiple dimension Xarrays. For example, the Fortran array X X\begin{verbatim} X DIMENSION A(2,2) X\end{verbatim} X X\noindent Xis stored in ascending memory addresses as \tt A(1,1), A(2,1), A(1,2), XA(2,2) \rm from low address to high address. The equivalent C array X X\begin{verbatim} X float a[2][2]; X\end{verbatim} X X\noindent Xis stored as \tt a[0][0], a[0][1], a[1][0], a[1][1] \rm from low Xaddress to high address. A complete transliteration would transpose Xindices from left to right to preserve the actual memory ordering. XThis was not done in this transliteration. Assuming that the original XFortran had been carefully coded for memory access, this would cause a Xdegradation of performance of the C code on some systems. X XThe Livermore Loops utilize the fairly common Fortran practice of Xincluding a number of related items together in a \tt COMMON \rm Xblock in order to make global references to those items possible. X XOne of the side effects of this practice is that Fortran requires that Xitems which are adjacently listed in a common block are adjacently Xallocated in memory. This makes possible the practice of initializing Xan entire \tt COMMON \rm block by treating it as a single array in an Xinitialization routine. Figure~1 shows an example in Fortran. X X\begin{figure}[ht] X\small X\caption{\tt COMMON \rm usage in Fortran} X\begin{verbatim} X XC XC THIS ROUTINE USES THREE ELEMENTS OF A COMMON BLOCK BY SOME NAME XC X COMMON /CBLOCK/ A1, B1, C1 X SUBROUTINE AROUTINE XC XC USE A1, B1, C1 XC X RETURN X END XC XC THIS ROUTINE INITIALIZES THE COMMON BLOCK USING AN ARRAY XC X COMMON /CBLOCK/ A(3) X SUBROUTINE INIT X DO 10 I = 1, 3 X 10 A(I) = I X RETURN X END X X\end{verbatim} X\end{figure} X XThe initialization routines in the Livermore Loops take advantage of Xthis. There are several \tt COMMON \rm blocks, each of which contains Xmany arrays. There is an initialization routine which treats each \tt XCOMMON \rm block as a single array, and initializes it in a single Xloop. X XIt is possible to duplicate the adjacent storage effect of \tt COMMON X\rm blocks in C. Figure~2 shows an example of this usage. X X\begin{figure}[ht] X\small X\caption{Translating \tt COMMON \rm usage to C} X\begin{verbatim} X Xstruct cblock { X float A_[3]; X} CBLOCK; X X#define A1 CBLOCK.A_[0] X#define B1 CBLOCK.A_[1] X#define C1 CBLOCK.A_[2] X#define A CBLOCK.A_ X Xvoid init() X{ X int i; X for (i = 0; i < 3; i++) X A[i] = i; X} X Xmain() X{ X init(); X fprintf(stdout,"A1 = %f, B1 = %f, C1 = %f\n", A1, B1, C1); X} X X\end{verbatim} X\end{figure} X XIt is possible to use this technique within the \tt struct \rm to duplicate Xany possible naming which comes up within a \tt COMMON \rm block. In Xparticular, multiple arrays can be implemented by using either unions Xand preprocessor defines, if they do not overlap, or names and Xpreprocessor macro functions if they do overlap. X XBecause the only uses of the \tt COMMON \rm blocks within the Livermore XLoops is to provide global data and to simplify initialization and Xchecksum calculation, this approach was not taken, but rather each Xvariable was declared as a C global, and the initialization routines Xwere modified to explicitly initialize each routine separately. This Xhas an impact on both the initialization and checksum software. X XOn machines for which \tt double \rm provides more precision than \tt Xfloat \rm in C, numerical calculations are done differently in C than Xin Fortran. When all of the variables involved in a calculation are Xof the same precision, Fortran calculates all intermediate results to Xthat precision. C promotes all variable to the precision of \tt Xdouble, \rm does all calculations in double and converts to the Xprecision of the result. This yields different numerical results when Xthere are subexpressions, for instance in the case X X\begin{verbatim} Xresult = a * b - c * d X\end{verbatim} X XFortran would calculate the two products, convert to the appropriate Xprecision, perform the addition and store the result. C would Xcalculate the two products in \tt double \rm precision, perform a \tt Xdouble \rm precision add, and then convert the result. This can yield Xquite different numbers. X XThe routine which generates the initial data was carefully coded to Xmimic Fortran conversion, so that the initial data would be the same Xin the C case as in the Fortran case, and care was taken to see that Xsubroutine calls were performed with the correct argument types, but Xthe test routines calculate the results using default C promotion Xrules. X X\section{Verifying the Transliteration} X XThe Convex compilers provide statement counting as a profiler option. XThe Fortran and C versions of the loops were run with no optimization Xand with statement profiling enabled. The statement counts were Xcompared on a statement by statement basis for the two programs. All Xloops were found to agree exactly. X XOnce the statement counts were completed, the two programs were run on Xthe Cray 2, which has the same precision for \tt float \rm and \tt Xdouble \rm in C. The Fortran version was compiled with CFT77 release X3.0 with all optimization disabled. The C version was compiled with Xthe UniCos 4.0 C compiler. X XLoops 1 through 5, 7, 11, 12, 13, 14, 16, 17, 20, and 24 produce Xidentical checksums to 12 decimal places, which is nearly the limit of Xfloating point precision on the Cray 2. Loops 6, 8, 9, 10, 15, 18, X21 and 23 all match to at least five decimal places. Careful Xinspection of the loops was made by running the Fortran and C versions Xsimultaneously under a source level debugger on a Silicon Graphics XIris 4D workstation. Several additional bugs were corrected, but no Xdifference in the numerical results was found. It is suspected that Xthe remaining differences are due to bugs in the way in which the Xchecksums are calculated under C. X X\section{Overview of Results} X XThe various tables show the reported megaflop rates for the C and Fortran Xversion of the Livermore Loops on several machines which were Xavailable to the author. Of the machines which were available, the XIris and the Sun were chosen as representative of the current Xgeneration of engineering workstation, the Amdahl as representative of Xcurrent mainframes, the Convex and Alliant as representative of two Xdifferent approaches to minisupercomputers, and the Cray 2 as a Xrepresentative supercomputer. All of the machines are running recent Xreleases of the vendor's version of the Unix operating system. X XFor all machines, the overall performance of the C compiler was better Xthan the overall performance of the Fortran compiler. For the mainly XUnix systems, this was not surprising, but it was surprising on the XCray 2. A detailed comparison of the optimizations done by the two XCray compilers on loops which were significantly different is Xpresented below. X X\begin{table}[htbp] X\caption{Results in Megaflops} X\small X\begin{tabular}{|r|rr|rr|rrr|} X\hline X{\bf Manufacture} & \multicolumn{2}{|c|}{Alliant} & X\multicolumn{2}{|c|}{Amdahl} & \multicolumn{3}{|c|}{Cray} \\ X{\bf Model} & \multicolumn{2}{|c|}{FX/8} & \multicolumn{2}{|c|}{5880} & X\multicolumn{3}{|c|}{2} \\ X{\bf Compiler} & cc & fortran & cc & f77 & cc (vector) & cft77 (vector) & Xcft77 (scalar) \\ X\hline X1 & 10.72 & 15.02 & 15.02 & 1.88 & 764.77 & 96.89 & 13.53 \\ X2 & 7.76 & 0.78 & 11.64 & 1.16 & 30.28 & 18.45 & 6.40 \\ X3 & 7.51 & 1.00 & 15.02 & 1.72 & 21.39 & 69.57 & 11.29 \\ X4 & 6.55 & 0.72 & 9.00 & 1.03 & 18.32 & 32.10 & 4.16 \\ X5 & 7.06 & 0.71 & 12.00 & 1.09 & 20.36 & 9.85 & 10.67 \\ X6 & 5.95 & 0.49 & 10.82 & 1.01 & 21.29 & 9.08 & 3.90 \\ X7 & 15.92 & 19.10 & 18.73 & 2.51 & 929.28 & 147.31 & 16.35 \\ X8 & 8.07 & 14.26 & 11.56 & 1.07 & 41.67 & 113.39 & 17.55 \\ X9 & 14.72 & 10.30 & 25.75 & 1.72 & 748.91 & 113.07 & 13.88 \\ X10 & 4.54 & 5.45 & 9.09 & 0.55 & 21.40 & 35.32 & 6.68 \\ X11 & 4.29 & 0.43 & 7.50 & 0.75 & 11.00 & 9.02 & 7.57 \\ X12 & 4.00 & 5.99 & 6.66 & 0.60 & 246.53 & 30.61 & 5.24 \\ X13 & 1.68 & 0.27 & 4.48 & 0.24 & 6.44 & 1.75 & 1.66 \\ X14 & 3.06 & 1.16 & 5.46 & 0.52 & 14.61 & 7.22 & 3.88 \\ X15 & 5.29 & 2.54 & 5.86 & 0.63 & 23.05 & 6.94 & 6.66 \\ X16 & 4.54 & 0.45 & 10.60 & 1.59 & 15.28 & 4.00 & 3.86 \\ X17 & 7.79 & 1.82 & 13.63 & 2.73 & 36.74 & 8.08 & 7.81 \\ X18 & 10.62 & 7.26 & 9.07 & 0.88 & 64.29 & 100.59 & 11.61 \\ X19 & 7.27 & 0.73 & 12.12 & 1.21 & 27.08 & 12.40 & 11.34 \\ X20 & 10.91 & 1.47 & 17.93 & 2.40 & 50.41 & 12.33 & 11.39 \\ X21 & 7.13 & 1.82 & 6.07 & 0.60 & 15.16 & 23.43 & 6.13 \\ X22 & 6.06 & 5.15 & 8.59 & 0.86 & 47.05 & 90.76 & 5.94 \\ X23 & 7.10 & 0.93 & 9.33 & 0.91 & 30.69 & 9.41 & 8.99 \\ X24 & 3.53 & 6.00 & 10.00 & 1.00 & 8.32 & 1.95 & 1.87 \\ XHarmonic Mean & 5.59 & 1.09 & 9.37 & 0.87 & 22.38 & 9.26 & 5.7 \\ X\hline X\end{tabular} X\end{table} X X X\begin{table}[htbp] X\caption{Results in Megaflops} X\small X\begin{tabular}{|r|rr|rr|rr|} X\hline X{\bf Manufacture} & \multicolumn{2}{|c|}{Convex} & X\multicolumn{2}{|c|}{Silicon Graphics} & \multicolumn{2}{|c|}{Sun} \\ X{\bf Model} & \multicolumn{2}{|c|}{C1} & \multicolumn{2}{|c|}{Iris 4D} & X\multicolumn{2}{|c|}{Sun 3/250} \\ X{\bf Compiler} & vc & fc & Mips cc & Mips f77 & cc & f77 \\ X\hline X1 & 1102.90 & 175.23 & 8.94 & 1.32 & 1.35 & 0.18 \\ X2 & 8.49 & 1.57 & 7.76 & 1.29 & 1.11 & 0.12 \\ X3 & 5.82 & 9.57 & 6.90 & 1.33 & 1.10 & 0.20 \\ X4 & 7.37 & 1.34 & 6.00 & 0.71 & 0.99 & 0.11 \\ X5 & 6.81 & 1.32 & 5.00 & 0.69 & 1.05 & 0.11 \\ X6 & 4.16 & 3.80 & 5.67 & 0.86 & 0.97 & 0.12 \\ X7 & 1811.50 & 331.52 & 13.38 & 1.75 & 1.93 & 0.18 \\ X8 & 10.82 & 247.95 & 12.73 & 1.25 & 1.41 & 0.16 \\ X9 & 1428.90 & 195.82 & 13.21 & 1.56 & 1.66 & 0.16 \\ X10 & 2.57 & 7.15 & 8.26 & 0.91 & 0.74 & 0.09 \\ X11 & 4.72 & 2.18 & 3.57 & 0.63 & 0.72 & 0.08 \\ X12 & 2.55 & 16.34 & 3.57 & 0.62 & 0.75 & 0.08 \\ X13 & 1.73 & 0.56 & 2.24 & 0.14 & 0.45 & 0.05 \\ X14 & 4.24 & 1.34 & 4.27 & 0.37 & 0.73 & 0.08 \\ X15 & 3.94 & 1.65 & 5.14 & 0.61 & 1.41 & 0.23 \\ X16 & 5.18 & 0.88 & 7.57 & 0.88 & 1.18 & 0.17 \\ X17 & 7.55 & 1.31 & 11.36 & 1.82 & 0.99 & 0.16 \\ X18 & 8.50 & 11.38 & 9.99 & 1.12 & 1.35 & 0.16 \\ X19 & 7.89 & 1.83 & 5.51 & 1.01 & 0.96 & 0.12 \\ X20 & 7.17 & 1.60 & 10.83 & 1.42 & 1.72 & 0.25 \\ X21 & 3.97 & 7.66 & 6.93 & 0.91 & 0.84 & 0.12 \\ X22 & 5.60 & 101.18 & 6.87 & 0.90 & 1.94 & 0.32 \\ X23 & 6.59 & 2.52 & 9.07 & 1.05 & 1.21 & 0.14 \\ X24 & 90.46 & 9.16 & 4.35 & 0.63 & 1.13 & 0.14 \\ XHarmonic Mean & 5.62 & 2.37 & 6.09 & 0.72 & 1.02 & 0.13 \\ X\hline X\end{tabular} X\end{table} X X\section{Alliant Performance} X XThe Alliant Fortran compiler performs analysis to determine if a loop Xis vectorizable or can be executed concurrently on multiple Xprocessors. If the loop can be executed concurrently and in vector Xmode, it tries to find a way to use both vectorization and Xconcurrently simultaneously. The Alliant C compiler performs neither Xvectorization nor concurrency analysis and always produces scalar Xsingle processor code. The Alliant Fortran compiler suffers from Xtrying to hard. Sometimes it produces vector or concurrent code for Xwhich the overhead of the code exceeds the gain from the facility. In Xthese case, such as loops 3, 9 and 13, the scalar C code performs more Xefficiently than the optimized Fortran code. In other cases, such as Xloops 1 and 7, the Fortran code out performs the C code. In no case Xdoes the optimization give a speedup of more than 2 over the scalar Xcode. X X\section{Amdahl Performance} X XThe Amdahl Fortran compiler is a straightforward port of the BSD Unix Xf77 compiler. It does not appear to have been tuned to the Amdahl Xarchitecture at all and generates very poor code. The Amdahl C Xcompiler is more tuned to the architecture and in some cases, Amdahl Xscalar code performance approaches that of the Cray 2. Because of the Xlarge difference between the compilers, a detailed comparison was not Xundertaken. X X\section{Cray 2 Performance} X XThere are four loops (1, 7, 9, and 12) for which C achieves X``impossible'' performance on the Cray 2, as a result of exceeding the Xmaximum performance of a single processor. These are the result of an Xartifact of the benchmark. C outperforms the Fortran code on most Xscalar loops, (5, 11, 13, 16, 17, 19, 20, and 24) with one exception. X(22) Fortran recognizes more loops as vectorizable than C, and Xproduces better performance on the one loop (18) which both compilers Xvectorize. There is mixed performance on those loops which Fortran Xvectorizes and C does not. In some cases, (2, 6, 14, 15, and 23) the Xscalar C code is more efficient that the vector Fortran code. In the Xremaining cases (3, 4, 8 10 and 21) the vector Fortran code is more Xefficient than the scalar C code. X XThe impossible performance of C is a result of an artifact of the Xbenchmarks. To obtain a large enough time interval for meaningful Xmeasurement, all of the loops are of the form: X X\begin{verbatim} X XFOR I = 1 TO NUMBER_OF_PASSES DO X BEGIN X do_the_loop_in_question X END X X\end{verbatim} X XIn the impossible cases, the loop in question is of form similar to: X X\begin{verbatim} X XFOR J = 1 to ARRAY_LENGTH DO X BEGIN X X[J] := Y[J] op Z[J]; X END X X\end{verbatim} X XIn this very simple case, it is possible for a reasonable compiler to Xdetermine that the entire inner loop is invariant with respect to the Xouter loop, move the inner loop outside the outer loop, and then throw Xthe outer loop away because it doesn't compute anything. If the Xvariable on the left side of the equation never appears on the right Xside, the loop is invariant and can be collapsed. Close inspection Xshows that the performance advantage of C on Loops 1, 7, 9, and 12 can Xbe explained as a result of this optimization of the outer loop. The XFortran compiler misses this optimization. X XThe original Fortran version of Loop 2 contains a compiler directive Xto force vectorization. The C version does not force vectorization, Xbut instead produces scalar code. Further, the vector length is 101, Xwhich is fairly short compared to the overhead introduced by setting Xup the vector loop. C produces faster code for this case, as a result Xof not vectorizing the loop. Loop 6 shows a similar phenomena. In Xthis case, Fortran recognizes a conditional dependency and generates Xcode to use vector instructions when it doesn't occur. Again the Xoverhead is not justified and the C version is faster. This also Xhappens in loops 14 and 23. X XIn other cases, the Fortran compiler vectorizes and the C compiler Xdoes not and the Fortran code code runs faster than the C code. The Xperformance improvement varies between 1.5 and 3.2 times. X XIn all but one of the cases in which both languages generate scalar Xcode, C generates more efficient code than Fortran. The exception is Xloops 22. Fortran generates inline code for the \tt exp \rm Xfunction, while C generates calls to a subroutine. Although current C Xcompilers don't usually generate inline code, the ANSI draft standard Xprovides a mechanism for defining and implementing inlines, and some Xcompilers do support the mechanism. X XOn the Cray 2, C generates shorter less convoluted loops and uses more Xregister\footnote{Actually, local memory, which C treats as a large Xregister file.} references than Fortran. Overall, this appears to Xgive C a scalar performance advantage over Fortran. Compiling the XFortran loops with vectorization disabled shows C scalar performance Xto vary from 2 to 5 times faster than Fortran scalar performance. X X\section{Convex Performance} X XThe Convex compilers produce considerable documentation of their Xactions. Figure~3 shows part of the vectorization summary from the C Xcompiler for the file containing the loops. The Convex compiler Xrecognizes a wide range of language constructs, including goto driven Xloops as potential for vectorization. It is able to realize in some Xcases that it should not vectorize a loop and frequently able to give Xgood information about why a loop cannot be vectorized. X X\begin{figure}[ht] X\small X\caption{Vectorization using Convex C} X\begin{verbatim} X XSource Iter. Vector- XLine Var. Start Stop Step ization Reason X--------------------------------------------------------------------------- X 42 k *EXPR* ipntp 2 None Not enough vector code X 53 k *EXPR* n 1 PART Vector (50%) X 99 ky *EXPR* n 1 FULL Vector X 149 l *EXPR* lp 1 None Unable to distribute X 167 ip *EXPR* n 1 PART Vector (92%) X 213 l *EXPR* lp 1 None Not trip counted X 266 k *EXPR* n 1 None Inner loop: exit X 329 l *EXPR* lp 1 None Inner loop: unknown trips X X\end{verbatim} X\end{figure} X XBoth the Convex C and Fortran compilers recognized the same loops as Xvectorizable for the same reasons, except for loop 22, which Fortran Xrecognized and 24, which C recognized. The compiler which recognized Xa loop as vectorizable produced much faster code. If both compilers Xproduced scalar code or both produced vector code, the C compiler Xproduced more efficient code. The C compiler appears to do a better Xjob of managing register allocation and memory references and so Xproduces code with less memory bandwidth requirements. X X\section{Silicon Graphics Performance} X XThe Silicon Graphics Iris 4D uses a MIPS R2000\cite{Kane87} RISC Xprocessor chip set, with a floating point accelerator. RISC XArchitecture machines depend heavily on compiler technology to Xgenerate efficient code in all cases, and it is not surprising that the XMIPS machine performs well. That Fortran did not fare as well as C Xcan be attributed to the primary market place of the Iris, which uses XC as its major programming language. It appears that more effort has Xgone into C compiler than has gone into the Fortran compiler. X X\section{Sun 3 Performance} X XThe Sun 3 numbers show a classic example of difference in code Xgeneration quality. Sun has spent a lot of time on the C compiler, Xwhich is the mainstay of their product and not as much time on the XFortran compiler. As a result, the C compiler produces more efficient Xcode than the Fortran compiler in all cases. X X\section{Conclusions} X XThese results demonstrate that on certain machines, using certain Xcompiler implementations, it is possible for a C compiler to produce Xmore efficient code than a Fortran compiler for a specific numerical Xapplication. Which compiler generates more efficient code appears to Xdepend more on the maturity of the compiler than on the nature of the Xmachine. From the way in which the loops were transliterated, it is Xsafe to say that it is possible to write C code which can be easily Xoptimized, as it is possible to write Fortran code which can be easily Xoptimized. X XThis is not to say that C has all of the features which make Fortran a Xgood choice for numerical calculation. An informal poll of numerical Xprogrammers who are familiar with both Fortran and C produced a Xpartial list of features seen to be important which are not present in Xthe C language. The list included the integer exponentiation operator, Xautomatic dimensioning of parameter arrays, complex numbers and Xthe repeat count in formatted i/o. Dynamic memory allocation and Xpointers were viewed as an advantage of C over Fortran. X XCurrent compiler technology is making it more difficult to develop Xreasonable benchmarks which measure what they intend to measure. XAlthough the Livermore Loops have held up well over time, they are Xstarting to fall to optimizing compilers. In particular, the ability Xto detect invariance of the outer loop has led to invalid results Xbeing returned for particular loops. X XAs McMahon pointed out, performance can be as much a factor of the Xmaturity of compiler technology as of the machine on which the code Xruns. This is as true for codes written in C as it is for Fortran. XIt is also true for the relative performance of various compilers from Xthe same vendor on a given machine. X XIt is possible to ``overoptimize.'' Vector operations include some Xoverhead, such as pipeline start up, which must be accounted for by Xthe optimizer. Concurrent operations also introduce overhead. Some Xcompilers are more sophisticated about this than others. The Convex Xcompiler will not vectorize a loop if it believes that there is Xinsufficient code to benefit. X XSome vendors appear to have paid more attention to vector optimization Xin their Fortran compilers than to scalar, although many of these Xloops could benefit from scalar optimization. This is most apparent Xin the difference between C and Fortran scalar performance on the Cray 2. X XIt is possible to write numerical codes in C and have them compiled Xinto efficient code which produces correct results. Any code which can Xbe written in Fortran can also be written in C to produce the same Xresults, and a good C compiler can generate efficient results from the Xspecified code. X X\bibliography{unix,misc} X\bibliographystyle{plain} X X\end{document} END_OF_FILE if test 28304 -ne `wc -c <'cloops.tex'`; then echo shar: \"'cloops.tex'\" unpacked with wrong size! fi # end of 'cloops.tex' fi echo shar: End of archive 3 $of 3$. cp /dev/null ark3isdone MISSING="" for I in 1 2 3 ; do if test ! -f ark${I}isdone ; then MISSING="${MISSING} ${I}" fi done if test "${MISSING}" = "" ; then echo You have unpacked all 3 archives. rm -f ark[1-9]isdone else echo You still need to unpack the following archives: echo " " ${MISSING} fi ## End of shell archive. exit 0 exit 0 # Just in case...